DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon
نویسندگان
چکیده
Abstract Finding word boundaries in continuous speech is challenging as there little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use Dirichlet process to jointly segment sentences and build lexicon types. We introduce DP-Parse, which uses similar principles but only relies on an instance tokens, avoiding the clustering errors that arise with On Zero Resource Speech Benchmark 2017, our model sets new state-of-the-art 5 languages. The algorithm monotonically improves better input representations, achieving yet higher scores when fed weakly supervised inputs. Despite lacking type lexicon, DP-Parse can be pipelined language learn semantic syntactic representations assessed by spoken embedding benchmark. 1
منابع مشابه
Word Boundaries in French: Evidence from Large Speech Corpora
The goal of this paper is to investigate French word segmentation strategies using phonemic and lexical transcriptions as well as prosodic and part-of-speech annotations. Average fundamental frequency (f0) profiles and phoneme duration profiles are measured using 13 hours of broadcast news speech to study prosodic regularities of French words. Some influential factors are taken into considerati...
متن کاملMorphological Lexicon Extraction from Raw Text Data
We introduce a tool extract developed for automatic extraction of lemma-paradigm pairs from raw text data. The tool combines regular expressions containing variables with propositional logic to form search patterns which identify lemmas tagged with their paradigm class. Furthermore, we describe the underlying algorithm of the tool and suggest a method for developing a morphological lexicon. The...
متن کاملParsing with subdomain instance weighting from raw corpora
The treebanks that are used for training statistical parsers consist of hand-parsed sentences from a single source/domain like newspaper text. However, newspaper text concerns different subdomains of language use (e.g. finance, sports, politics, music), which implies that the statistics gathered by generative statistical parsers are averages over subdomain statistics. In this paper we explore a...
متن کاملLearning the lexicon from raw texts for open-vocabulary Korean word recognition
In this paper, we propose a novel method of building a language model for open-vocabulary Korean word recognition. Due to the complex morphology of Korean, it is inappropriate to use lexicons based on the linguistic entities such as words and morphemes in openvocabulary domains. Instead, we build the lexicon by collecting variable length character sequences from the raw texts using a dynamic Ba...
متن کاملAutomatic Generation of Compound Word Lexicon for Hindi Speech Synthesis
This paper addresses the problem of Hindi compound word splitting and its relevance to developing a good quality phonetizer for Hindi Speech Synthesis. The constituents of a Hindi compound word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms can not be applied to Hindi. We propose a new technique for automatic extraction of compound words from Hin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Transactions of the Association for Computational Linguistics
سال: 2022
ISSN: ['2307-387X']
DOI: https://doi.org/10.1162/tacl_a_00505